## NULL

Simulation parameters

The true prevalence \(\pi\) has been simulated at 0.0957865. Coders’ ability parameters were drawn from the following two Beta-distributions:

From these specificity and sensitivity distributions, tupples of ability parameters were randomly drawn for 40 coders. Coders’ simulated abilities have the following empirical distribution.

Simulation results

In total, I have randomly sampled 500 items from a Bernoulli distribtion with the simulated \(\pi\) value of 0.0957865. Given a missingness rate of .75 (i.e., each item was judged by only 10 out of total 40 coders), I have then generated total 10 codings for each of these items. In order to examine how model-based posterior classifications of items perform as a function of the total number of items coded (\(n\)), I have then split from the entire items 500 blocks of 200, 250, 300, 350, 400, and 450 items, such that items in smaller sized blocks are nested in respective larger sized blocks (i.e., all items in block 200 are also in the 250 block, etc.). This is thought to imitate the situation were we collect increasing amount of codings for new items. For the differently \(n\)-size datasets, we get the following empirical prevalences:

Empirical prevalence of positive class in differently sized codings datasets
\(n\) Empirical prevalence
200 0.080
250 0.084
300 0.077
350 0.080
400 0.092
450 0.093
500 0.090

In order to examine how repeated coding of items affects model-based classification quality (i.e., increasing the number of judgments aggregated per coding, \(n_i\)), I have sampled different numbers of judgments for each item, such that \(n_i \in \{3, 4, \ldots, 10\}\). Again, I have applied a nesting logic when spliting the entire codings dataset. That is, for a given item, all judgements that are in the \(n_i = 3\) subset are also in the \(n_i = 4, ..., 10\) subsets, etc. Mirroring the logic of fitting BBA models to differntly \(n\)-size but nested codings datasets, this simulation strategy mimics a situation where one collects increasing numbers of repeated codings from different coders for a given item. This yields the following grid:

No. judgments in codings datasets with varying \(n\) and \(n_i\)
\(n\)/\(n_i\) 3 4 5 6 7 8 9 10
200 600 800 1000 1200 1400 1600 1800 2000
250 750 1000 1250 1500 1750 2000 2250 2500
300 900 1200 1500 1800 2100 2400 2700 3000
350 1050 1400 1750 2100 2450 2800 3150 3500
400 1200 1600 2000 2400 2800 3200 3600 4000
450 1350 1800 2250 2700 3150 3600 4050 4500
500 1500 2000 2500 3000 3500 4000 4500 5000

Fit results

DIC

  • plot deviances

  • inspect shrinkage and Autocorrelation (results not shown)

Prevalence

Inspecting the mixture of chains of posterior estimates of \(\pi\), we see that although for lower values of \(n_i\) there is more variability in estimates, there is no drift in estimates and chains mix nicely.

Moreover, we can see that the posterior densities of all models are by and large close to the simulated prevalence parameter. The pull of the uniform Beta(1,1) prior is stronger in models fitted to small-\(n\) subsets of the codings data, however. Generally, precision increases as \(n\) and \(n_i\) are increased.

Posterior classifications

Turning to posterior class estimates and the posterior classifications induced by assigning items to the median posterior class estimate (identical to assignment based on whether or not the average posterior class estimate is > .5), a point of particular concern is how certain we are about posterior classifications.

Posterior classification uncertainty

Here, I define posterior classification uncertainty (PCU) as the standard deviation in posterior estimates across chains and iterations: \[ \text{SD}(\mathbf c_i) = \sqrt\frac{\sum_{t=1}^T (c_{it} - \bar{c}_i)^2}{T-1}, \] where \(t\) indexes the \(t^{th}\) estimate and here \(T = 1000 \times 3\) (iterations times chains). Note that the theoretical maximum of PCU is reached if an item is estimated to be a member of the positive class exactly half T/2 times, and this maximum approaches .5 as \(T \to \infty\).

A first view at the distributions ofUs for different values of \(n\) and \(n_i\) illustrates that our posterior classifications are comparatively uncertain if we aggregate only few judgments per item. As \(n_i\) increases, however, the PCU of most items is reduced substantially, and in the extreme case of \(n_i = 10\) is reduced to neglible levels in virtually all itmes. We can also see that increasing \(n\) contributes only little to change this pattern.

These results are summarized by the following figure that plots the change in mean PCU and the standard deviation inUs for different combinations of \(n\) and \(n_i\).

While we see now more clearly that differences in \(n\) make no significant difference, neither for changes in the mean PCU (left-hand panel) nor in the standard deviation of PCUs (right-hand panel), for all values of \(n\) there occurs some reduction in mean PCU values as \(n_i\) is increased. Similarly, The standard deviation in PCUs is lower for higher values of \(n_i\). Taken together, we thus see that both average PCU values and their variability decreases as \(n_i\) is increased.

The above plot says little about the significance of the reductions in mean PCUs, however. Therefore, the below table reports the proportion of items whose PCU is increased if aggregating \(n_i + l\) instead of \(n_i\) judgments for \(l \in (1, \ldots, 7)\) (displayed in columns 3–9). The proportion of items with positive PCU change from \(n_i\) to \(n_i + l\) can be interpreted as a significance test: if less than 5% of items see increase, we are 95% confident that the change in PCU induced by collecting and aggregating an additional \(l\) judgments leads to an average decrease inUs.

Proportion of items with positive change in posteriro classification uncertainty
\(n\) \(n_i\) 1 2 3 4 5 6 7
200 3 0.055\(^+\) 0.02\(^*\) 0.015\(^*\) 0.015\(^*\) 0.025\(^*\) 0.02\(^*\) 0.005\(^{**}\)
200 4 0.115 0.055\(^+\) 0.05\(^*\) 0.04\(^*\) 0.03\(^*\) 0.02\(^*\)
200 5 0.105 0.155 0.07\(^+\) 0.035\(^*\) 0.025\(^*\)
200 6 0.225 0.095\(^+\) 0.04\(^*\) 0.02\(^*\)
200 7 0.095\(^+\) 0.04\(^*\) 0.015\(^*\)
200 8 0.03\(^*\) 0.01\(^{**}\)
200 9 0.005\(^{**}\)
250 3 0.044\(^*\) 0.016\(^*\) 0.008\(^{**}\) 0.012\(^*\) 0.032\(^*\) 0.008\(^{**}\) 0.004\(^{**}\)
250 4 0.156 0.064\(^+\) 0.068\(^+\) 0.068\(^+\) 0.04\(^*\) 0.02\(^*\)
250 5 0.112 0.108 0.1\(^+\) 0.072\(^+\) 0.032\(^*\)
250 6 0.14 0.12 0.064\(^+\) 0.028\(^*\)
250 7 0.104 0.052\(^+\) 0.02\(^*\)
250 8 0.044\(^*\) 0.008\(^{**}\)
250 9 0.008\(^{**}\)
300 3 0.08\(^+\) 0.027\(^*\) 0.023\(^*\) 0.017\(^*\) 0.027\(^*\) 0.017\(^*\) 0.007\(^{**}\)
300 4 0.127 0.06\(^+\) 0.073\(^+\) 0.053\(^+\) 0.043\(^*\) 0.013\(^*\)
300 5 0.133 0.11 0.07\(^+\) 0.033\(^*\) 0.01\(^{**}\)
300 6 0.15 0.087\(^+\) 0.033\(^*\) 0.013\(^*\)
300 7 0.083\(^+\) 0.027\(^*\) 0.003\(^{**}\)
300 8 0.023\(^*\) 0\(^{***}\)
300 9 0.01\(^{**}\)
350 3 0.051\(^+\) 0.023\(^*\) 0.023\(^*\) 0.02\(^*\) 0.026\(^*\) 0.009\(^{**}\) 0.006\(^{**}\)
350 4 0.109 0.046\(^*\) 0.037\(^*\) 0.043\(^*\) 0.026\(^*\) 0.017\(^*\)
350 5 0.126 0.111 0.063\(^+\) 0.031\(^*\) 0.017\(^*\)
350 6 0.157 0.094\(^+\) 0.037\(^*\) 0.02\(^*\)
350 7 0.106 0.029\(^*\) 0.014\(^*\)
350 8 0.034\(^*\) 0.009\(^{**}\)
350 9 0.023\(^*\)
400 3 0.14 0.038\(^*\) 0.028\(^*\) 0.028\(^*\) 0.028\(^*\) 0.012\(^*\) 0.005\(^{**}\)
400 4 0.115 0.05\(^*\) 0.045\(^*\) 0.032\(^*\) 0.018\(^*\) 0.005\(^{**}\)
400 5 0.135 0.105 0.052\(^+\) 0.045\(^*\) 0.008\(^{**}\)
400 6 0.178 0.072\(^+\) 0.048\(^*\) 0.012\(^*\)
400 7 0.08\(^+\) 0.048\(^*\) 0.01\(^{**}\)
400 8 0.048\(^*\) 0.01\(^{**}\)
400 9 0.005\(^{**}\)
450 3 0.093\(^+\) 0.02\(^*\) 0.027\(^*\) 0.029\(^*\) 0.018\(^*\) 0.009\(^{**}\) 0.004\(^{**}\)
450 4 0.082\(^+\) 0.058\(^+\) 0.058\(^+\) 0.036\(^*\) 0.02\(^*\) 0.011\(^*\)
450 5 0.171 0.142 0.058\(^+\) 0.033\(^*\) 0.02\(^*\)
450 6 0.202 0.06\(^+\) 0.036\(^*\) 0.018\(^*\)
450 7 0.053\(^+\) 0.033\(^*\) 0.016\(^*\)
450 8 0.038\(^*\) 0.013\(^*\)
450 9 0.029\(^*\)
500 3 0.076\(^+\) 0.016\(^*\) 0.028\(^*\) 0.028\(^*\) 0.018\(^*\) 0.012\(^*\) 0.004\(^{**}\)
500 4 0.112 0.058\(^+\) 0.052\(^+\) 0.036\(^*\) 0.014\(^*\) 0.004\(^{**}\)
500 5 0.172 0.114 0.05\(^*\) 0.028\(^*\) 0.008\(^{**}\)
500 6 0.17 0.074\(^+\) 0.026\(^*\) 0.008\(^{**}\)
500 7 0.07\(^+\) 0.028\(^*\) 0.006\(^{**}\)
500 8 0.028\(^*\) 0.006\(^{**}\)
500 9 0.008\(^{**}\)

So, for instance, for \(n = 200\), if we aggregate four instead of three judgments per items, this leads to a decrease in PCU in only about 94.5% of cases. We are 98.5% (97.5%) certain, however, that collecting seven (eight) judgments instead of three judgmentes per item results in an average decrease in item-levelU, which meets the 5% confidence level criterion. In order to recude average PCU levels, it thus seems advisable in most cases to collect multiple judgments per item (i.e., increase \(n_i\)) rather than judgments for new items (i.e., increase \(n\)).

Another point underlining the added value of collecting multiple judgments per item while holding \(n\) constant is illustrated by the following figure that separates items with positive average PCU change aggregated over all possible judgment increases from \(n_i\) to \(n_i +1\) for \(n_i \in \{3, \ldots, 9\}\) from all other items. These items are important as average PCU increase characterizes items that the models tend to classify erroneously with high levels of posterior certainty when aggregating only few judgments per item and whose classification becomes on average less certain if we add judgments.

We see indeed that if we collect only few judgments per item (i.e., low values fo \(n_i\)), items that are characterized by large PCU when aggregating many judgments per item are rather proportionally distributed across the entireU-value range. Thus, when collecting only few judgments per item, we are not able to separate items with an average increase in PCU as \(n_i\) is increased from other, apprently less problematic items. Indeed, such separation is usually only possible for values of \(n_i > 5\) (significantly, the size of \(n\) makes no apparent difference in this regard).

Especially worrying is that the uncertainty about posterior classification of some of the items that the model classifies quiet confidently based on relatively few judgments per items increases as we collect more judgments for these items: Indeed, for these items we observe a radical increase in PCU as \(n_i\) is increased. If we wouldn’t have collected more judgments for these items, we would be lead to err in believing that we know their true classes quiet confidently.

Note, however, that the absolute proportion of these items is rather small overall, as the following figure documents.

Indeed, this is good news: increasing \(n_i\) leads to average reduction in PCU for the vast majority of items apparently rather independently from the size of \(n\).

Posterior classification performance

Given that wew have simulated items and thus know their ‘true’ values, we can also assess model performance by computing and comparing conventional classification performance metrics. Specfically, in order to assess models’ classification performance, I have first obtained items’ true classes from the simulated codings dataset. Next, I have compared model-based classifications for each value of \(n\), \(n[i]\), each chain, and each iteration to true classes. Using this data, I have then computed statistics (mean, std. dev., and 5% and 95% percentiles) of performance measures across chains and iterations for each combination of \(n\) and \(n_i\). The following figure visualizes these statistics for different classification performance metrics. (Note the scaling of the x-axis, 90% of values are in the [.7,1] range.)

  • Accuracy, defined as the proportion of correctly classified items, increases (weakly) as \(n_i\) is increased. There exist no substantial accuracy differentials across values of \(n\), however. (Only 90%-CIs get tighter as \(n\) is increased, which is due to higher precision resulting from computing accuracy from a increasingly larger samples.)
  • Recall (also: true-positive rate, TPR), defined as the number of true-positive classifications over the sum of true-positive and false-negative classifications (i.e., over the number of all positive items), exhibits much more variability across values of \(n_i\).
  • Similarly, precision, defined as the number of true-positive classifications over the sum of all positive classification (incl. false-positives), is relatively low for low \(n_i\), but increases quiet rapidly as \(n_i\) is increased. The CIs for precision estimates are comparatively wide, however.
  • In contrast, TNR, the true-negative rate defined as the number of true-negative classifications over the number of all negative items, is very high already for low values of \(n_i\), and thus we observe no substantial improvements in the negative detection rates of models as the number of judgments aggregated per item is increased. This is due to the fact that with a low positice instance prevalence, our simulated coders judge in expectation true negative items about four to five times more than positive instance. There is thus more data and hence higher precision in negative detection.
  • Finally, the F1-score that combines recall and precision into one metric1 increases significantly as \(n_i\) is increased. This is because in the denominator low Precision depresses the F1-score.

The next table reports significance levels of positive F1-score change as \(n_i\) is increased by \(l\) (columns 3–9). It reports the proportion of iterations across chains for which an increase from \(n_i\) to \(n_i+l\) judgments per item induces a decrease in the F1-score. Agian, this can be used as a significance test. The logic is simple: as we want to see F1-score increases as \(l\) is increased, the smaller the proportion with a F1-score reduction, the better.

Proportion of iterations with negative change in F1-score
\(n\) \(n_i\) 1 2 3 4 5 6 7
200 3 0.168 0.025\(^*\) 0.009\(^{**}\) 0.006\(^{**}\) 0.002\(^{**}\) 0.001\(^{***}\) 0\(^{***}\)
200 4 0.108 0.072\(^+\) 0.042\(^*\) 0.013\(^*\) 0.007\(^{**}\) 0\(^{***}\)
200 5 0.407 0.199 0.079\(^+\) 0.067\(^+\) 0.006\(^{**}\)
200 6 0.222 0.081\(^+\) 0.06\(^+\) 0.007\(^{**}\)
200 7 0.126 0.098\(^+\) 0.015\(^*\)
200 8 0.126 0.019\(^*\)
200 9 0.02\(^*\)
250 3 0.199 0.042\(^*\) 0.014\(^*\) 0.007\(^{**}\) 0.003\(^{**}\) 0.001\(^{***}\) 0\(^{***}\)
250 4 0.16 0.089\(^+\) 0.041\(^*\) 0.016\(^*\) 0.01\(^{**}\) 0.001\(^{***}\)
250 5 0.339 0.16 0.063\(^+\) 0.054\(^+\) 0.007\(^{**}\)
250 6 0.205 0.078\(^+\) 0.051\(^+\) 0.005\(^{**}\)
250 7 0.141 0.113 0.014\(^*\)
250 8 0.142 0.019\(^*\)
250 9 0.022\(^*\)
300 3 0.204 0.034\(^*\) 0.011\(^*\) 0.004\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
300 4 0.141 0.069\(^+\) 0.028\(^*\) 0.007\(^{**}\) 0.005\(^{**}\) 0.001\(^{***}\)
300 5 0.327 0.149 0.061\(^+\) 0.037\(^*\) 0.005\(^{**}\)
300 6 0.2 0.078\(^+\) 0.042\(^*\) 0.007\(^{**}\)
300 7 0.146 0.09\(^+\) 0.012\(^*\)
300 8 0.128 0.024\(^*\)
300 9 0.025\(^*\)
350 3 0.106 0.011\(^*\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
350 4 0.154 0.041\(^*\) 0.009\(^{**}\) 0.003\(^{**}\) 0.001\(^{***}\) 0\(^{***}\)
350 5 0.264 0.057\(^+\) 0.008\(^{**}\) 0.002\(^{**}\) 0\(^{***}\)
350 6 0.125 0.029\(^*\) 0.007\(^{**}\) 0.002\(^{**}\)
350 7 0.128 0.072\(^+\) 0.048\(^*\)
350 8 0.138 0.094\(^+\)
350 9 0.101
400 3 0.136 0.01\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
400 4 0.125 0.035\(^*\) 0.007\(^{**}\) 0.002\(^{**}\) 0\(^{***}\) 0\(^{***}\)
400 5 0.259 0.065\(^+\) 0.006\(^{**}\) 0.002\(^{**}\) 0.001\(^{***}\)
400 6 0.141 0.023\(^*\) 0.006\(^{**}\) 0.001\(^{***}\)
400 7 0.117 0.061\(^+\) 0.043\(^*\)
400 8 0.131 0.078\(^+\)
400 9 0.088\(^+\)
450 3 0.135 0.007\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
450 4 0.084\(^+\) 0.003\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
450 5 0.098\(^+\) 0.022\(^*\) 0.002\(^{**}\) 0\(^{***}\) 0\(^{***}\)
450 6 0.165 0.012\(^*\) 0.003\(^{**}\) 0.001\(^{***}\)
450 7 0.091\(^+\) 0.046\(^*\) 0.022\(^*\)
450 8 0.135 0.071\(^+\)
450 9 0.083\(^+\)
500 3 0.095\(^+\) 0.002\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
500 4 0.072\(^+\) 0.003\(^{**}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\) 0\(^{***}\)
500 5 0.087\(^+\) 0.023\(^*\) 0.001\(^{***}\) 0\(^{***}\) 0\(^{***}\)
500 6 0.188 0.019\(^*\) 0.004\(^{**}\) 0.002\(^{**}\)
500 7 0.094\(^+\) 0.044\(^*\) 0.022\(^*\)
500 8 0.123 0.065\(^+\)
500 9 0.078\(^+\)

We see that while it is not possible to assert significant F1-score imrpovments for higher-values of \(n_i\) in models fitted to small-\(n\) subsets (due to smaller sample sizes and hence less power), we have reason to be confident that increasing the number of judgments if we have collect only few thus far would lead to improvments in the models classification performance.

Another approach to visualizing posterior classification performance follows Carpenter (2008) and plots the (absolute) residual classification error for items differentiating between true-positive and true-negative items:

Again we can generally see that posterior mean estimates perform comparatively poorly in correctly classifying true positives. While most true-negative items have low absoulte residual classification error (ARCE) and are hence correctly classified (correct classification for ARCE <= .5), true-positive items are sometimes misclassified, especially for low values of \(n_i\) (some true-positive items have ARCE > .5). This reflect the generally high levels of true-negative detection ability of models accross values of \(n\) and \(n_i\), in contrast to the compartively low true-positive detection ability (recall). Yet, though misclassification occurs more often for true-positive items, the confidence in these items’ classifications is limited (most ACREs are close tot the classficiation threshold of .5). What is more, this proportion is reduced as \(n_i\) is increased, as most absolute residual values approach zero or at least get lower than .5.

Because, the above plot reports proportions across the ACRE-value range that were computed within item classes (for each \(n\)\(n_i\) combination), the total proprotions are eclipsed. Indeed, as true items make up only between 7.6666667, 9.3333333% of all items in the respective subsets, the amount of misclassification resulting from poor true-positive detection rates is actually not too grave as the following plot shows more accurately. (Note that the x-axis of this plot depict the real-valued residual classification error, not absolute values).

In these figures, negative residuals measure how much the mean posterior class estimate (across chains an iterations) of a model deviates from the true value of 1 (0). That is, residuals in model-based classification of true-positive itmes are negative in \([-1,0)\), whereas residuals in resulting from model-based classification of true-negative itmes are always positive in \((0,1]\). Actual misclassification of true-positives (true-negatives) occurs only if the residual is smaller (greater) than -.5 (.5)—hence the two vertical dashed lines at -.5 and .5. Indeed, because for each true-positive item there are about 9 true-negatives, misclassification of true-negative items occurs more often in absolute terms, whereas in relative terms models’ true-negative detection abilities (TNR) tends to be better than their true-positive detection ability (Recall).

Comparison to majority-voting based classifications

Disagreements between model-based classifications and majority voting
\(n\) 3 4 5 6 7 8 9 10
200 3 (1.5%) 2 (1%) 2 (1%) 2 (1%) 3 (1.5%) 4 (2%) 4 (2%) 4 (2%)
250 2 (0.8%) 5 (2%) 3 (1.2%) 4 (1.6%) 4 (1.6%) 4 (1.6%) 4 (1.6%) 4 (1.6%)
300 3 (1%) 4 (1.33%) 3 (1%) 4 (1.33%) 4 (1.33%) 5 (1.67%) 4 (1.33%) 4 (1.33%)
350 6 (1.71%) 2 (0.57%) 4 (1.14%) 4 (1.14%) 5 (1.43%) 6 (1.71%) 6 (1.71%) 6 (1.71%)
400 6 (1.5%) 9 (2.25%) 5 (1.25%) 6 (1.5%) 6 (1.5%) 6 (1.5%) 7 (1.75%) 6 (1.5%)
450 11 (2.44%) 8 (1.78%) 6 (1.33%) 7 (1.56%) 8 (1.78%) 8 (1.78%) 9 (2%) 6 (1.33%)
500 13 (2.6%) 8 (1.6%) 7 (1.4%) 8 (1.6%) 9 (1.8%) 9 (1.8%) 9 (1.8%) 9 (1.8%)

Indeed, disagreements occur not too frequently. Alternatively, we can assess agreement by computing correlations between majority voting and model-based classifications:

Correlation between model-based classifications and majority voting
\(n\) 3 4 5 6 7 8 9 10
200 0.900 0.926 0.930 0.930 0.894 0.857 0.857 0.857
250 0.947 0.863 0.920 0.887 0.892 0.892 0.892 0.892
300 0.930 0.902 0.927 0.898 0.902 0.877 0.902 0.902
350 0.882 0.959 0.920 0.914 0.899 0.878 0.878 0.878
400 0.910 0.863 0.924 0.905 0.908 0.908 0.892 0.908
450 0.861 0.893 0.917 0.903 0.891 0.891 0.877 0.919
500 0.845 0.894 0.910 0.897 0.886 0.886 0.886 0.886
## # A tibble: 1 x 15
##   sample_size   n_i n_judgments Accuracy    tp    fn    fp    tn   TPR
##         <dbl> <int>       <int>    <dbl> <int> <int> <int> <int> <dbl>
## 1         200     9         200     0.98    12     4     0   184  0.75
## # … with 6 more variables: TNR <dbl>, FPR <dbl>, FNR <dbl>,
## #   Precision <dbl>, Recall <dbl>, `F1-score` <dbl>

A stiking result of ther classification performances of majorty voting is that we get no false-positives. Hence, both the Precision and TNR for all models are perfect (false-positives factor in the denominator of both metrics). And what is more, because of the strong class imbalance, the accuracies are also very close to perfect. Indeed the F1-score is a much more informative metric in presence of such intense class imbalance.

Given that we obtain posterior classifications for each iteration and each chain of each model, we can also compute how confident we can be that the model and majority voting induce different classification performances. These differences are illustrated for F1-scores in the following plot. To obtain 90% confidence bounds, the majority-voting based F1-score point estimates have been substracted from scores induced by model-based classifications of each iteration. This yielded 3000 differences that were then aggregated into means and 90%-CIs.

This shows that while majority voting outperforms model-based classifications in terms of F1-scores for all values of \(n\) (though not significantly for all values), we see that for higher \(n_i\) the model tends to perform (significantly) better. Specifically, for \(n_i > 6\) (differences for \(n_i = 6\) may be induced by random tiebreaking in majority voting), we are 95% certain for all values of \(n\) that the model-based classification yield higher F1-scores than does majority voting. Note that the choice of comparing the F1-scores of model and majority voting induced comparisons was theoretically informed. In presence of such strong class imbalance, the accuracy is not a good performance criterion. By taking both precision and recall into account, we can achieve high F1-scores only if our aggregation method performs also reasonable well in correctly classifying true-positive items—a job the BBA model is doing better than the majority vote for \(n_i > 6\) in the simulated data.

References

Carpenter, Bob. 2008. “Multilevel Bayesian Models of Categorical Data Annotation.” Unpublished manuscript. unpublished manuscript. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.1374&rep=rep1&type=pdf.


  1. The formula is \(\text{F1-score} = 2\times\frac{\text{Precision}\times \text{Recall}}{\text{Precision} + \text{Recall}}\)